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Abstract 

This paper considers linear-quadratic control of a non-linear dynamical system subject to arbitrary 
cost. I show that for this class of stochastic control problems the non-linear Hamilton-Jacobi-Bellman 
equation can be transformed into a linear equation. The transformation is similar to the transforma- 
tion used to relate the classical Hamilton- Jacobi equation to the Schrodinger equation. As a result of 
the linearity, the usual backward computation can be replaced by a forward diffusion process, that can 
be computed by stochastic integration or by the evaluation of a path integral. It is shown, how in the 
deterministic limit the PMP formalism is recovered. The significance of the path integral approach 
is that it forms the basis for a number of efficient computational methods, such as MC sampling, the 
Laplace approximation and the variational approximation. We show the effectiveness of the first two 
methods in number of examples. Examples are given that show the qualitative difference between 
stochastic and deterministic control and the occurrence of symmetry breaking as a function of the 



1 Introduction 

The problem of optimal control of non-linear systems in the presence of noise occurs in many areas 
of science and engineering. Examples are the control of movement in biological systems, robotics, and 
financial investment policies. 

In the absence of noise, the optimal control problem can be solved in two ways: using the Pontrya- 
gin Minimum Principle (PMP) 1 which is a pair of ordinary differential equations that are similar to 
the Hamilton equations of motion or the Hamilton-Jacobi-Bellman (HJB) equation which is a partial 
differential equation [2]. 

In the presence of Wiener noise, the PMP formalism can be generalized and yields a set of coupled 
stochastic differential equations, but they become difficult to solve due to the boundary conditions at 
initial and final time (see however [!]])■ In contrast, the inclusion of noise in the HJB framework is 
mathematically quite straight-forward. However, the numerical solution of either the deterministic or 
stochastic HJB equation is in general difficult due to the curse of dimensionality. Therefore, one is 
interested in efficient methods for solving the HJB equation. The class of problems considered below 
allows for such efficient methods. 

In section 13.11 we consider the control of an arbitrary non-linear dynamical system with arbitrary 
cost, but with the restriction, that the control acts linearly on the dynamics and the cost of the control 
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Figure 1: The drunken spider. In the absence of noise (alcohol in this case), the optimal trajectory for 
the spider is to walk over the bridge. When noise is present, there is a significant probability to fall off 
the bridge, incurring a large cost. Thus, the optimal noisy control is to walk around the lake. 

is quadratic. For this class of problems, the non-linear Hamilton- Jacobi-Bellman equation can be trans- 
formed into a linear equation by a log transformation of the cost-to-go. The transformation stems back 
to the early days of quantum mechanics and was first used by Schrodinger to relate the Hamilton-Jacobi 
formalism to the Schrodinger equation. See section [7| for a further discussion on this point. The log 
transform was first used in the context of control theory by ^ (see also 

Due to the linear description, the usual backward integration in time of the HJB equation can be 
replaced by computing expectation values under a forward diffusion process. This is treated in section rOl 
The computation of the expectation value requires a stochastic integration over trajectories that can be 
described by a path integral fsection l3.3|) . This is an integral over all trajectories starting at x, i, weighted 
by exp(— iS/V), where S is the cost of the path (also know as the Action) and v is the size of the noise. 
It has the characteristic form of a partition sum and one should therefore expect that for different values 
of the noise v the control is qualitatively different, and that symmetry breaking occurs below a critical 
value of v. 

In general, control problems may have several solutions, corresponding to the different local minima 
of S. The case is illustrated in fig.^ A spider wants to go home, by either crossing a bridge or by going 
around the lake. In the absence of noise, the route over the bridge is optimal since it is shorter. However, 
the spider just came out of the local bar, where it had been drinking heavily with its friends. He is not 
quite sure about the outcome of its actions: any of its movements may be accompanied by a random 
sway to the left or right. Since the bridge is rather narrow, and spiders don't like swimming, the optimal 
trajectory is now to walk around the lake. Thus, we see that the optimal control in the presence of noise 
can be quantitatively different from the deterministic control. 
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In addition to which path to chose, the spider also has the problem when to make that decision. Far 
away from the lake, he is in no position to chose for the bridge or the detour, as he is still uncertain of 
where his random swaying may bring him. In other words, why would he spend control effort now to 
move left or right when there is a 50 % change that he may wander there by chance? He decides to delay 
his choice until he is closer to the lake. The question is, when should he make his decision to move left 
or right? 

It is in these multi-modal examples, that the difference between deterministic and stochastic control 
becomes most apparent. They are not only of concern to spiders, but occur quite general in obstacle 
avoidance for autonomous systems, differential games, and predator-prey scenarios. Current efficient 
approaches to control are essentially restricted to unimodal situations and therefore cannot address these 
issues. The aim of the present paper is to introduce a class of multimodal control problems that can be 
efficiently solved using path integral methods. 

The path integral formulation is well-known in statistical physics and quantum mechanics, and several 
methods exist to compute them approximately. The Laplace approximation approximates the integral by 
the path of minimal S and is treated in section 0] This approximation is exact in the limit of v — > 0, and 
the deterministic control law is recovered. The formalism is illustrated for the linear quadratic case in 
section Further refinements to the Laplace approximation can be made by considering the quadratic 
fluctuations around the deterministic solution (also know as the semi-classical approximation), but I 
believe that this correction has a small effect on the control (it does strongly affect the value of J but 
not its gradient). The semi-classical approximation is not treated in this paper. 

As is shown in section 14.31 in the Laplace approximation the optimal stochastic control becomes a 
mixture of deterministic control strategies, weighted by exp(— S/is) and can be computed efficiently, The 
path integral displays a symmetry breaking at a critical value of v: For large v, the optimal control is 
the average of the deterministic controls. For small is, one of the deterministic controls is chosen. In 
section 16. 1. 21 we give the example of the delayed choice problem that displays such symmetry breaking 
as a function of the time to reach the target. 

In general, the Laplace approximation may not be sufficiently accurate. Possibly the simplest alterna- 
tive is Monte Carlo (MC) sampling. The naive sampling procedure proposed by the theory is presented in 
section l5.ll but is shown to be rather inefficient in the double slit example in section l6~Tl It is not difficult 
to devise more efficient samplers. In section 15.21 we propose an importance sampling scheme, where the 
sampling distribution is a (mixture of) diffusion processes with drift given by the Laplace deterministic 
trajectories. The importance sampling method is compared with the exact results for the double slit 
problem in section fa. T. II In section fo. 21 we compute the optimal control for the drunken spider for low 
noise using the Laplace approximation and for high noise using MC importance sampling. 

We begin our story with a brief derivation of the HJB equation for stochastic optimal control, which 
is treated in depth in many good textbooks (see for instance 0E1E])- 

2 Stochastic optimal control 

Consider the stochastic differential equation 

dx = b(x(t),u(t),t)dt + d£. (I) 

x, b, dt; and dx are n-dimensional vectors and u is an m-dimensional vector of controls, dt; is a Wiener 
processes with (d£kd£i) — i>ki(x,u,t)dt. The initial state of x is fixed: x(U) = Xi and the state at final 
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time tf is free. The problem is to find a control trajectory u(t), ti < t < tf, such that 

C( Xi ,ti,u(-)) = (&(x(t f ))+£ f dtf (x(t),u(t),t)j (2) 

is minimal. The subscript Xi on the expectation value is to remind us that the expectation is over all 
stochastic trajectories that start in Xi. 

The standard construction of the solution for this problem is to set up a partial differential equation 
that is to be solved for all times in the interval ti to tf and for all x. For this purpose, we define the 
optimal cost-to-go function from any intermediate time t and state x: 



J(x,t) = min C{x,t,u{t — > tf)) (3) 

where u(t — > tf) denotes the sequence of controls u(-) on the time interval [i,t/]. For any intermediate 
time t',t < t' < tj we can write a recursive formula for J in the following way: 

J(x,t) = min U(x(t f ))+ [ dtf (x(t),u(t),t)+ f ' dtf (x(t),u(t),t)\ 
"(*-»*/) \ Jt Jt' / 



mm 



/' 



u(t->*') \Jt U(t'->tf) 



dtf (x(t),u(t),t) + min U(x(t f ))+ / dtf (x(t),u(t),t) 



x(t'). 



= min// dtf {x(t),u(t),t) + J(x(t'),t')\ (4) 

«(*->*') \7t / 

The first line is just the definition of J. In the second line, we split the minimization over two intervals. 
These are not independent, because the second minimization is conditioned on the starting value x(t'), 
which depends on the outcome of the first minimization. The last line uses again the definition of J. 

Setting t' = t + dt we can Taylor expand J(x(t'),t') around t. This expansion takes place within the 
expectation value and need to be performed to first order in dt and second order in dx, since (da; 2 ) = 
O(dt). This is the standard Ito calculus argument. Thus, 

(J{x(t + dt),t + dt)) x = (j{x,t) +d t J(x,t)dt + (d x J(x,t)) T dx+ ^Tr (<9 2 J(x,t)dx 2 )^ 

= J(x, t) + d t J(x, t)dt + (d x J(x, t)) T b(x, u, t)dt + ^Tr (<9 2 J(x, t)v(x, u, t)) dt 

In this expression, dt and d x denotes partial differentiation with respect to t and x, respectively. Similarly, 
d% J is the matrix of second derivatives of J and Tr(vd x J) = ■ Vij q®.q x - Substituting this into Eq.0] 
dividing both sides by dt and taking the limit of dt — > yields 

-d t J(x,t) = min (f (x,u,t) + b(x,u,t) T d x J(x,t) + ^Tr (v(x,u,t)dl J {x,t))^j , \/t,x (5) 

which is the Stochastic Hamilton- J acobi- Bellman Equation with boundary condition J(x,tf) = (j){x). 
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Eq. \5\ reduces to the deterministic HJB equation in the limit v — > 0. In that case, an alternative 
approach to solving the control problem is the Pontryagin Maximum principle (PMP), which requires 
the solution of 2n ordinary differential equations. These equations need to be solved with multi-point 
boundary conditions at both £j and tf. Solving 2n ordinary differential equations may be more efficient 
than solving the n-dimensional partial differential equation, using shooting methods (see for instance 7 ), 
but may be unstable in some cases. 

In the stochastic case, there does not exist a generic alternative to solving the pde (see however [3j 
for stochastic versions of the PMP approach). Thus, for stochastic control one needs to solve the HJB 
equation, which suffers from the curse of dimensionality. 

A notable exception is when b is linear in x and u and fo is quadratic in x and u. This is called 
the linear-quadratic (LQ) control problem. In that case, it can be shown that the solution for J(x,t) 
is quadratic in x with time-varying coefficients. These coefficients satisfy coupled ordinary differential 
(Ricatti) equations that can be solved efficiently 0. 

3 A path integral formulation for control 
3.1 A linear HJB equation 

Consider the special case of Eqs.^ an d El where the dynamic is linear in u and the cost is quadratic in u: 

dx = (b(x, t) + Bu)dt + d£ (6) 



C{xi,U,u{-)) = (4>(x(t f ))+J dt^-u(tyRu(t)+V(x(t),t)jJ (7) 

with B an n x m matrix and R an m x m matrix. B, R and v are independent of x, u, t. b and V 
are arbitrary functions of x and t and 4> is an arbitrary function of x. In other words, the system to 
be controlled can be arbitrary complex and subject to arbitrary complex costs. The control instead, is 
restricted to the simple LQ form. 

The stochastic HJB equation |51 becomes 

-d t J = min (^u T Ru + V + (b + Bu) T d x J + ^Tr 

Minimization with respect to u yields: 

u= -R- 1 B T d x J(x,t) (8) 
which defines the optimal control u for each x, t. The HJB equation becomes 

-d t J = -^(d x J) T BR~ 1 B T d x J + V + b T d x J+^Tr^d 2 x J) 

This partial differential equation must be solved with boundary condition J{x,tt) — 4>{x). Note, that 
after performing the minimization with respect to to u, the HJB equation has become non-linear in J. 

We can remove the non-linearity and this will turn out to greatly help us to solve the HJB equation. 
Define ip(x,t) through J(x,t) = —\\ogip(x,t), with A a constant to be defined. Then 

~(d x J) T BR- 1 B T d x J + ^Tr (vd 2 x J) 
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ij ij ij ■> 

The terms quadratic in ip vanish if and only if there exists a scalar A such that 

v = \BR~ 1 B T (9) 

In other words, the matrices v and BR~ l B T must be proportional to each other with proportionality 
constant A. In the one dimensional case, such a A always exists, and Eq. [5] is not a restriction. In the 
higher dimensional case, Eq.EI restricts the possible choices for the matrices R and v. To get an intuition 
for this restriction, consider the case that u and x have the same dimension, B is the identity matrix and 
both R and v are diagonal matrices. Then Eq. states R cx v^ 1 . In a direction with low noise, control 
is expensive (Ru large) and only small control steps are permitted. In the limiting case of no noise, we 
deduce that u should be set to zero: no control is allowed in noiseless directions. In noisy directions the 
reverse is true: control is cheap and large control values are permitted. Loosely speaking, Eq. states 
that noise and control should operate in the same dimensions. 1 

When Eq. holds, the quadratic terms in the HJB equation cancel and the HJB becomes 

= -Hip (10) 

with H a linear operator acting on the function ip. Eq. 1101 must be solved backwards in time with 
ip{x,tf) = exp(— <p(x)/\). However, the linearity allows us to reverse the direction of computation, 
replacing it by a diffusion process, as we will explain in the next section. 

To simplify the exposure in the subsequent sections, we assume the control dimension m — n and B 
the unit matrix. 

3.2 Forward diffusion 

For real functions p and if>, define the inner product (p\i/j) = J dxp(x,t)ij)(x,t). Then we can define H\ 
the Hermitian conjugate of the operator H , with respect to this inner product as follows. 

(fftp^) - ( P \Hip) = J dx P {x,t) ^-ZM + fc^^ + iE^a^j ^) 

1 As a natural example, consider a one-dimensional second order system subject to additive control 9 = f(B, t) + u. The 
first order formulation is obtained by setting x\ = 9 and X2 = 9. Then 

dxi = (bi(x, t) + Biu)dt, i = l,2 

with bi(x,t) = X2, b2(x,t) = f{x\,t) and B = (0, 1) T . Since u is one-dimensional, R is a scalar and 

Condition Eq. ^states that the stochastic dynamics must have the noise restricted to the second component only: 

dxi = (bi (x,t) + Biu)dt + d^5i } 2, i = 1,2 

with ((ft; 2 ) = vdt and A = vR. 



dx — 



V ^ t \ (x,t) - d x (b(x,t)p(x,t)) + \ Yl Vl 3 Qx Q x p ( x ^ ) ^( x >*) 

ij 1 3 



where we have performed integration by parts and assume that p vanishes at |x| — > oo. Thus, 

ijtp = _]^^ p ( a ; jt )_e (B (6( a . )i )p( a;>t )) + I^ I/ij ._^_p( 3 . >t ). 

ij 1 3 

Let p(y,r\x, t) be a probability density, initialized at t,x, that evolves forward in time according to the 
diffusion process 

d tP = H^p (11) 

with drift b(x, t)dt and diffusion d£, and with an extra term due to the potential V. Whereas the other two 
terms conserve probability density, the potential term takes out probability density at a rate V(x, t)dt/\. 
Therefore, the stochastic simulation of Eq. lit I is a diffusion that runs in parallel with the annihilation 
process: 

dx = b(x, t)dt + d£ 
x = x + dx, with probability 1 — Vix, t)dt/\ 

xi = f, with probability V(x,t)dt/X (12) 

where f denotes that the particle is taken out of the simulation. Note that when V = this diffusion 
process is identical to the original control dynamics Eq. [|J]in the absence of control (u = 0). 

Since tp evolves backwards in time according to H and p evolves forwards in time according to the 
inner product J dyp{y,T\x,t)ip{y,T) is time invariant (independent of r). Since p(y,t\x,t) — 5(y — x), it 
immediately follows that 

ip(x,t) = J dyp(y,tf\x,i)ip{y,tf) (13) 

We arrive at the important conclusion that ip(x,t) can be computed either by backward integration using 
Eq. or by forward integration of a diffusion process given by Eq. El The optimal cost-to-go is finally 
given by 

J(x,t) = -Mog J dyp{y,t f \x,t)exp(-cj>(y)/\) (14) 

with p(y,tf\x,t) given by the stochastic process Eq. The optimal control is given by Eq. |SJ See 
section ET2*1 for a simple Gaussian example that illustrate these ideas. 

3.3 The path integral formulation 

In this section, we will write the diffusion kernel p(y, tf\x, t) in Eq. ^] as a path integral. For an 
infinitesimal time step e, we can write the probability to go from x to y as an integral over all noise 
realizations. The probability of the Wiener is Gaussian with mean zero and variance ve. The particle 
annihilation destroys probability with rate V{x,t)e/ X. Combining annihilation with diffusion, we obtain 



p(y, t + e\x, t) cx exp | -~ 



\ {~~r~ - h: j - /: ) R ( " t] s * 1 " ! ; ! 



7 



where we have used v 1 = R/X. 

We can write the transition probability as a product of n infinitesimal transition probabilities: 



p(y,tf\x,t) oc dxi...dx n -i 



n-1 




In the limit of e — > 0, the sum in the exponent becomes an integral: cY^i=o ~ * it' ^ r anc ^ thus we can 
formally write 

p(y,t f \x,t) = J[dx]lexp (-jS path (x(t -> t f ))\ (15) 

VthW* - */)) = ^ dr (± (^jjjl - b{x{r), r)) T JZ - 6(x(r), r)) + W>r)j (16) 

with x(t — > tf) a path with x(r = t) = x,x(r = tj) = y, J[dx\}j. an integral over paths that start at x 
and end at y. 2 

Substituting Eq.[T3in Eg. ITU we can absorb the integration over y in the path integral and find 



(17) 



J(x,t) = -MogJ[dx] x exp f-^S(x(t -> tf)) 

where the path integral J[dx] x is over all trajectories starting at x and 

S(x(t^t f )) = cj>(x(t f )) + S path (x(t -» t f )) (18) 

is the Action associated with a path. 

The path integral Eq. El is a log partition sum and therefore can be interpreted as a free energy. 
The partition sum is not over configurations, but over trajectories. S(x(t — > tf)) plays the role of the 
energy of a trajectory and A is the temperature. This link between stochastic optimal control and a free 
energy has two immediate consequences. 1) Phenomena that allow for a free energy description, typically 
display phase transitions and spontaneous symmetry breaking. What is the meaning of these phenomena 
for optimal control? 2) Since the path integral appears in other branches of physics, such as statistical 
mechanics and quantum mechanics, we can borrow approximation methods from those fields to compute 
the optimal control approximately. First we discuss the small noise limit, where we can use the Laplace 
approximation to recover the PMP formalism for deterministic control. Also, the path integral shows 
us how we can obtain a number of approximate methods: 1) one can combine multiple deterministic 
trajectories to compute the optimal stochastic control 2) one can use a variational method, replacing the 
intractable sum by a tractable sum over a variational distribution and 3) one can design improvements 
to the naive MC sampling. 

2 The paths are continuous but non-differential and there are different forward are backward derivatives 151151 . Therefore, 
the continuous time description of the path integral and in particular x are best viewed as a shorthand for its finite n 
description. 
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4 The Laplace approximation 
4.1 The Laplace approximation 

When A is small (i.e. v is small), we can expand an arbitrary path x(t) around the classical path: 

x(t) = x{t) + 5(t), t<T<tf 

where x(t) is the classical path that we need to determine, and 5(t) is an independent fluctuation of the 
path at time r. Fluctuations are also allowed at t = t and r = tf. The Action Eq. ^|can be expanded 
to first order in S(t) as 

dr (i±(T)-b(x,T))iRij (j^S^-Ski^dkb^T^+SiWdiVixiT),^ 

S(x(t -> tf))+6i(t f ) {d^{x{tf))+ P] {t f ))-p 3 (t)5 3 {t) 
' d 



dTS k ( T ) \^Pk{r) + Pj (T)d k b j {x,T)-d k V{x(r),T)j (19) 

where d k means partial differentiation with respect to x k , repeated indices are summed over and p is 
defined as 

p k (t) = (x(t)-b(x,t)) j R jk (20) 

The term proportional to Sk(r) under the integral must be zero and defines an ODE for the classical 
trajectory: 

j t Pk{t) + ^ fe(*)&i(z,f) - V(x,t)) = (21) 

Eq. 1201 can be seen as a definition of p, but also as a dynamical equation for x that must be solved 
together with the dynamical equation for p, Eq. 1211 These equations must be solved with boundary 
conditions. The boundary condition for x is given at initial time and the term proportional to Si(tf) 
defines the boundary condition for p(t) at t = tf. 

x t (t) = x, ft(t/ ) = -^M (22) 

Define the Hamiltonian, 

H (x, p, t) = \p T R~ x p + P T b( Xl t) - V(x, t) (23) 

Then, Eos. I20landl21lcan be written as 

dx dH(x,p,t) dp dH(x,p,t) 
dt dp dt dx 



(24) 



The Hamiltonian system Eas. l24l with the mixed boundary conditions Eas. l22l are the well-known ordinary 
differential equations of the Pontryagin Maximum Principle. 
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In the Laplace approximation, the path integral Eq. 1171 is replaced by the classical trajectory only. 
Thus, 



J(t,x) « S{x(t -> t f )) 
since fluctuations at initial time are zero: Si(t) = 0. The optimal control is given by 

u = -R- l 8 x J m = iJ-l p (t) = ±(t) - b(x(t),t) (25) 

ox(t) 

where we have used ^^""^^ = —p(t) from Eq. 1191 The intuition of the Laplace approximation is that 
one needs to solve the deterministic equations for the whole interval [t, tf], starting at the current place 
x. In particular, the end boundary condition (the location of the target) will affect the location of the 
optimal path for all [t — * tf]. The control is then given by the value of the pseudo-gradient x(t) — b{x{t), t) 
on this trajectory. 

Note the minus sign in front of V in Eq. [23 which has the opposite sign from a normal classical 
mechanical system. The term ^p T R~ 1 p can be interpreted as the kinetic energy of the system. Thus, 
the 'energy' H is not the sum, but the difference of kinetic and potential energy. When H does not 
explicitly depend on time (b(x,t) = b(x) and V(x, t) = V(x)), H is conserved under the deterministic 
control dynamics: 

dH _ dH dx dH dp _ Q 
dt dx dt dp dt 

because of Eqs. El To understand this behavior, consider 6 = 0. Then along the trajectory: 

^u T Ru = V(x) + H 

with H independent of time. This relation states that the optimal trajectory is such that much control 
is spent in areas of large cost and little control is spent in areas of low cost. 

Note, that the optimal control is independent of the noise v as we expect from the Laplace approx- 
imation. Numerically, we can compute the classical trajectory by discretizing x c \(t) — Xi,...,x n and 
minimizing S(x c \) — S(xx, . . . , x n ) using a standard minimization method. 

4.2 The linear quadratic case 

To build a bit of intuition for the diffusion process, the path integral and Laplace approximation, we 
consider in this section some simple one-dimensional linear quadratic examples. 
First consider the simplest case of free diffusion: 

V(x,t) = 0, b(x,t)=0, c/)(x) = ^ax 2 

In this case, the forward diffusion described by Eq. ^2 and ^| can be solved in closed form and is given 
by a Gaussian with variance a 2 — v(tf — t): 

^,MM) = ^ex P (-^) (26) 
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Since the end cost is quadratic, the optimal cost-to-go Eq. ^] can be computed exactly as well. The 
result is 



■ \ 1 CT 2 

J(x,t) = vR\og\ — )+--^ax 2 (27) 



with l/uf = 1/c 2 + a/vR. The optimal control is computed from Eq. [S] 

a 2 ^ R + a(t f -t) 



u = -R- l 8 x J = -R-^ax = 



We see that the control attracts x to the origin with a force that increases with t getting closer to tf. 
Note, that the optimal control is independent of the noise v. This is a general property of LQ control. 

As an extension, we now add a quadratic potential to the above problem: V(x) — \[ix 2 . We now 
compute the optimal control in the Laplace approximation. The Hamiltonian is given by Eq. 1231 

H(x,p) = ^R- 1 p 2 -\yx 2 
and the equations of motion and boundary conditions are given by Eas. 1241 and 1221 

x = p/R p = fix 
x(t) = x p(tf) — ~ ax (tf) 

We can write this as the second order system in terms of x only: 

x = iu,x/R, x{t) = x x(tf) = —ax(tf)/R 

The solution for t < t < tf is 

x(t) = Ae^^ + £? e -\/^-*) 



The boundary conditions become A + B = x and A r y(y/ fi/R + a/R) = B/j(WJl/R — ct/R), 7 = 
e V W 'R(t f -t) f rom w hich we can solve A and B. The classical Action Eq. ^| is computed by substi- 
tuting the solution for x : 

2 _ y/^R-a 

1 ' 



S(z(t-t/)) = -ax(t f ) 2 + ^l d T (Rx 2 {T)+vx 2 (T)) = ^nRx 2 — 



7 2 _j_ yV*-° 



nR+a 

which is equal to the cost-to-go in the Laplace approximation. The optimal control is minus the gradient 
of the cost-to-go. Note, that the classical trajectory as well as the minimal action only depends on the 
initial condition x and the time-to-go tf — t. For pure diffusion (// — > 0) the classical Action reduces to 

S(x(t -> tf)) = ; 

v v 2R + a(t f -t) 

which is identical to the exact expression Eq. |57| except for the volume factor (which does not affect the 
control, since it does not depend on x). 
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4.3 The multi-modal Laplace approximation 



The Action S in Eq. 1171 may have more than one local minimum. This is typical for control problems, 
where "many roads lead to Rome". Let x a {t — > tf),a = 1, . . . denote the different optimal deterministic 
trajectories that we compute by minimizing the Action: 

x a (t tf) = argmm x(t ^ tf) S{x(t -> t f )), a = 1, . . . 

These trajectories all start at the same value x. In our drunken spider example, there are two trajectories: 
one is over the bridge and the other is around the lake. Then, in the Laplace approximation the path 
integral Eq. 1171 is approximated by these local minima contributions only: 

J(x,t) » -A log ^ exp(— S(x a (t — > tf)/X) (28) 

a 

The Laplace approximation ignores all fluctuations around the mode. Although these fluctuations can be 
quite big, their x dependence is typically quite weak and must come from beyond Gaussian corrections. 
This can be seen from the pure LQ case when the Gaussian fluctuation term in Ea. l27l is independent of x. 
In the LQ case, the Laplace approximation for the control (not for the cost-to-go) coincides with the exact 
solution. Therefore, for unimodal problems (S has only one minimum) one can often safely ignore the 
contribution of fluctuations to the control. However, for multi-modal problems these fluctuation terms 
may have a strong a dependence (they have in the spider problem) and therefore play an important role 
when weighting the different contributions in Eq. 1281 

The optimal control becomes a soft-max of deterministic strategies 

-R^ 1 ^w a d x S(x a (t -> t f ) 

OC 

e -S(x a (t-H f )/\ 

where v plays the role of the temperature. 

5 MC sampling 

A natural method for computing the optimal control is by stochastic sampling. However, as is often the 
case with MC sampling, a naive sampler such as the one based directly on Eas. ll2l mav be very inefficient. 
In this section, we show how this naive sampler works and how it can be improved using importance 
sampling. 

5.1 Naive MC sampling 

The stochastic evaluation of Eq. ^| consists of running N times the diffusion process Eq. ^| from t to tf 
initialized each time at x(t) = x. Denote these N trajectories by Xi(t — > tf), i = 1, . . . , N. Then, ip(x, t) 
is estimated by 

i>(x,t)= w ^ = — exp(-<£(xi(*/))/A) (29) 

z£ alive 



u(x,t) = 

W Q = 
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where 'alive' denotes the subset of trajectories that do not get killed along the way by the f operation. 
Note that, although the sum is typically over less than N trajectories, the normalization 1/N includes 
all trajectories in order to take the annihilation process properly into account. 

The computation of u requires the gradient of ip{x, t) instead of ip itself. First note, that when we 
vary the initial point of a path x(t — > tf) from Eg. 1191 and I2UI we obtain 

Thus combining Eq. [S] and Eq. El we obtain 

1 



tp(x,t) 



[dx] x {x{t)-b(x,t))exp(-S/X) 



Note, that we can sample u by the same batch of (naive) trajectories. For each trajectory, the quantity 
x(t) — b(x, t) is proportional to the realisation of the noise in the initial time t: x(t) — b(x, t) = d£,i(t)/dt. 
Therefore, 



udt = -r-^ — Wid^(t) (30) 

V 1 5 ^) ^alivc 

with Wi given by Eq. 1291 This expression has a particular intuitive form. The optimal control at time t 
is obtained by averaging the initial noise directions of the trajectories d£i(t), weighted by their success 
Wi at the final time tf. 



5.2 Importance sampling 

The sampling procedure as described by Eas. 1121 and 1291 gives an unbiased estimate of ip{x,t) but can 
be quite inefficient. The problem is is well known, and one of the simplest procedures for improving the 
sampling is by importance sampling. For path integrals this works as follows. We replace the diffusion 
process that yields p(y,tf\x,t) with Action S^ath fEas.lTKlandlTf))) by another diffusion process, that will 
yield p'(y, tf\x, t) with corresponding Action S' patil . Then, 

ip(x,t) = J[dx] x exp(-S pat h/X)exp(-(j)/\) 

= J [dx] x exp (-Spath/A) exp (-(0 + S^th - S'p ath )/A) 

The idea is to chose the diffusion process p' such as to make the sampling of the path integral as efficient 
as possible. 

A suggestion that comes to mind immediately is to use the Laplace approximation to compute a 
deterministic control trajectory x*(t — > tf). From this, compute its derivative x*{t — > tf) and define a 
stochastic process to sample p' according to 

dx = x*(t)dt + d£ 
x = x + dx, with probability 1 — V(x, t)dt/X 

Xi = f, with probability V(x,t)dt/\ (31) 
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The Action S' path for the Laplace-guided diffusion is given by Ea.HSlwith b(x(r), r) = x*(t), t < t <tf. 
The estimators for ip and u are given again by Ecis. 1291 and 1301 with the difference that 

Wi = ^ exp (- (4>(xi(t f )) + S path ( Xi (t -> t f ))- S' path ( Xi (t ^ t f ))) /A) (32) 

and Xi (t — ► tf) is a trajectory from the sampling process Eq. [2] instead of Eq. We will illustrate the 
effectiveness of this approach in section rfi.il 

6 Numerical examples 

In this section, we introduce some simple one-dimensional examples to illustrate the methods introduced 
in this paper. The first example is a double slit, and is sufficiently simple that we can compute the 
optimal control by forward diffusion in closed form. We use this example to compare the Monte Carlo 
and Laplace approximations to the exact result. Using the double slit example, we show how the optimal 
cost-to-go undergoes symmetry breaking as a function of the noise and/or some other characteristics of 
the problem (in this case the time-to-go). When the targets are still far in the future, the optimal control 
is to 'steer for the middle' and delay the choice to a later time. 

The second example is similar to the first, except that the slit is now of finite thickness, allowing the 
particle to get lost in one of the holes. When one hole is narrow and the other wide, this illustrates the 
drunken spider problem. We use both the Laplace approximation and the the Monte Carlo importance 
sampling to compute the optimal control strategy, for different noise levels. 

6.1 The double slit 

Consider a stochastic particle that moves with constant velocity from t to tf in the horizontal direction 
and where there is deflecting noise in the x direction: 

dx = udt + e?£ 

The cost is given by Eq. 0with <p{x) — \x 2 and V(x, tx) implements a slit at an intermediate time t±, 
t < ti < tf. 

V(x, t\) = 0, a < x < b, c < x < d 
= co, else 

The problem is illustrated in Fig. where the constant motion is in the t direction and the noise and 
control is in the x direction perpendicular to it. 

Eq. El becomes A = vR and the linear HJB becomes: 

**-(i-5*)* 

which we must solve with end condition ip(x,tf) — e~^ x ^ x . 

Solving this equation by means of the forward computation using Eg. 1131 can be done in closed form. 
First consider the easiest case for times t > t\ where we do not have to consider the slits. This is the 
case we have considered before in section ^. 21 and the solution is given by Eg. 1271 with a = 1. 
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Figure 2: (a) The particle moves horizontally with constant velocity from t — 0totf~2 and is deflected 
up or down by noise and control. The end cost 4>{x) — x 2 /2. A double slit is placed at t\ — 1 with openings 
at —6 < x < —4 and 6 < x < 8. Also shown are two example trajectories under optimal control, (b) 
J(x,t) as a function of x for t = 0,0.99, 1.01, 2 as computed from Eq.l2*7landl55l R = 0.1, v= l,dt = 0.02. 



Secondly, consider t < t\. p(y,tf\x,t) can be written as a diffusion from t to tx, times a diffusion from 
t\ to tf integrating over all x in the slits. Substitution in Ea. 1181 we obtain 



b ,.d\ 

2 



a <J c 



ip(x,t) = dy\ + ) dx%exp(-y /2X)p(y,tf\xi,t 1 )p(xi,ti\x,t) 



p(y, tf\xi, ti) is Gaussian and given by Eg. 1261 Therefore, we can perform the integration over y in closed 
form. We are left with an integral over X\ that can be expressed in terms of Error functions. The result 
is 

J(x,t) = vR\og(—\+-^rX 2 ~vR\og-{F{b,x)~F{a 1 x)+F{d 1 x)-F{c 1 x)) (33) 
\o\ ) 2 o z 2 

B(x),\ A _ 1 



with F(x ,x) = Erf (y^(x - ^p)J, A = ^— t + R+ £ f _ tx and B(x) = Eqs. 03 and ESI together 

provide the solution for the control problem in terms of J and we can compute the optimal control from 
Eq.El 

A numerical example for the solution for J(x, t) is shown in fig. [2t>- The two parts of the solution 
(compare t = 0.99 and t — 1.01) are smooth at t = t\ for x in the slits, but discontinuous at t = t% outside 
the slits. For t = 0, the cost-to-go J is higher around the right slit than around the left slit, because the 
right slit is further removed from the optimal target x = and thus requires more control u and/or its 
expected target cost 4> is higher. 
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(a) Sample trajectories 



(b) MC sampling estimate of J(x, 0) 



Figure 3: Monte Carlo sampling of J(x, t = 0) with tp from Eq. ^| for the double slit problem. The 
parameters are as in fig. [21 (a) Sample of trajectories that start at x to estimate J(x,t). Only trajectories 
that pass through a slit contribute to the estimate, (b) MC estimate of J(x, t) — with N = 100000 
trajectories for each x. 

6.1.1 MC sampling 

We assess the quality of the naive MC sampling scheme, as given by Eas. 1121 and 1291 in fig. El where we 
compare J(x, 0) as given by Eq. [33] with the MC estimate Eq.[55| The left figure shows the trajectories of 
the sampling procedure for one particular value of x. Note, the inefficiency of the sampler because most 
of the trajectories are killed at the infinite potential at t = t\. The right figure shows the accuracy of the 
estimate of J(x,0) for all x between —10 and 10 using N = 100000 trajectories. Note, that the number 
of trajectories that are required to obtain accurate results, strongly depends on the value of x and A due 
to the factor exp(— cj)(x)/X) in Eq.^J For high A or low (</>), few samples are required (see the estimates 
around x — —4). For small noise or high (cf>) the estimate is strongly determined by the trajectory with 
minimal <f>(x(tf)) and many samples may be required to reach this x. In other words, sampling becomes 
more accurate for high noise, which is a well-known general feature of sampling. Also, low values of the 
cost-to-go are more easy to sample accurately than high values. This is in a sense fortunate, since the 
objective of the control is to move the particle to lower values of J so that subsequent estimates become 
easier. 

The sampling is of course particularly difficult in this example because of the infinite potential that 
annihilates most of the trajectories. However, similar effects should be observed in general due to the 
multi-modality of the Action. 

We can improve the sampling procedure using the importance sampling procedure outlined in sec- 
tion 15.21 using the Laplace approximation. The Laplace approximation to J requires the computation 
of the optimal deterministic trajectories. In general, one must use some numerical method to compute 
the Laplace approximation, for instance minimizing the Action Eq. [TBI using a time-discretized version 
of the path. In this particular example, however, we can just write down the classical trajectories 'by 
hand'. For each x, there are two trajectories, each being piecewise linear. The Action for each trajectory 
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Figure 4: Comparison of Laplace approximation (dotted line) and Monte Carlo importance sampling 
(solid jagged line) of J(x, t = 0) with exact result Eg. 1331 fsolid smooth line) for the double slit problem. 
The importance sampler used N = 100 trajectories for each x. The parameters are as in fig. [3 



is simply 



1 R R 

Si(x) =2 R J dU ^ = ^ ~ x ) 2 + 2 



al i = 1,2 



since <fi(x(tf)) — V(x(ti),ti) — by construction, = 6 and —4 for the two trajectories, respectively. 
The cost-to-go in the Laplace approximation is given by Eq. |2HI 

^La P iaco(a;, 0) = -vR\og exp — + exp ' 



A / \ A 

For each x, we randomly choose one of the two Laplace approximations with equal probability. We then 
sample according to Eq.|^with x* the selected Laplace approximation and estimate ip using Eg. 1291 and 
weights Eq. 1321 The Laplace approximation and the results of the importance sampler are given in fig. 
We see that the Laplace approximation is quite good for this example, in particular when one takes into 
account that a constant shift in J does not affect the optimal control. The MC importance sampler 
dramatically improves over the naive MC results in fig. El in particular since 1000 times less samples are 
used and is also significantly better than the Laplace approximation. 

6.1.2 The delayed choice 

Finally, we show an example how optimal stochastic control exhibits spontaneous symmetry breaking. 
To simplify the mathematics, consider the double slit problem, when the size of the slits becomes in- 
finitesimally small. Eq. 1331 with o = 1, 6 = 1 + e, c = —1 — e, d = —1 becomes to lowest order in 
e: 



J(x,t) = — -x - j/Tlog2cosli — + const. 
y 1 ' T \2 5 vTJ 

where the constant diverges as O(loge) independent of x and T = t\ — t the time to reach the slits. The 
expression between brackets is a typical free energy with inverse temperature (3 — 1/z/T. It displays a 
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(a) Optimal cost-to-go at different T. 



(b) Sample paths 



Figure 5: (a) Symmetry breaking in J as a function of T implies a 'delayed choice' mechanism for optimal 
stochastic control. When the target is far in the future, the optimal policy is to steer between the targets. 
Only when T < 1/v should one aim for one of the targets, v = R = 1. (b) Sample trajectories (top row) 
and controls (bottom row) under stochastic control Eq. (left column) and deterministic control Eq. l3~4l 
with v — (right column), using identical initial conditions x(t = 0) = and noise realization. 

symmetry breaking at vT — 1 (fig. Et)- For T > 1/v (far in the past) it is best to steer towards x — 
(between the targets) and delay the choice which slit to aim for until later. The reason why this is optimal 
is that from that position the expected diffusion alone of size vT is likely to reach any of the slits without 
control (although it is not clear yet which slit) . Only sufficiently late in time (T < 1/v) should one make 
a choice. The optimal control is given by the gradient of J: 



Figure |SJd depicts two trajectories and their controls under stochastic and deterministic optimal con- 
trol, using the same realization of the noise. Note, that at early times the deterministic control drives x 
away from zero whereas in the stochastic control drives x towards zero and smaller in size. The stochastic 
control maintains x around zero and delays the choice for which slit to aim until T rs 1 . 

The fact that symmetry breaking occurs in terms of the value of vT, is due to the fact that S oc 1/T, 
which in turn is due to the fact that u oc 1/T. Clearly, this will not be true in general. For an arbitrary 
control problem, S does not need to be monotonic in T, which means that in principle control can be 
shifting back and forth several times between the symmetric and the broken mode as T decreases to zero. 

6.2 The drunken spider 

In order to illustrate the drunken spider problem, we change the potential of the double slit problem so 
that it has a finite thickness: V(x, t) = for all t <t\ and t > and for t\ <t <t2- 




(34) 



V{x, t) = 0, a < x <b, c < x < d 
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= oo, else 



(35) 



The problem is illustrated in Fig. and the parameter values are given in the caption. 

The cost-to-go in the Laplace approximation is given by Eq. |2H1 with S(x( a (t — > tf)),a = 1,2 the 
cost of getting home over the bridge or around the lake, respectively. It is plotted as a function of the 
current position x as the solid line in fig. Et, for both v — 0.001 and v = 0.1 (these two curves coincide 
for these values of v, since S/v is so large that the softmax is basically a max). 

In addition, we compute J using importance sampling as outlined in section 15.21 For each x, we run 
hi = 1000 trajectories. For each trajectory, we select randomly one of the two Laplace trajectories with 
equal probability, which we denote by x*(t — > tf). The stochastic trajectory x(t — > tf) is then computed 
from Eq.[«n] It contributes to the partition sum Eq.[2Uwith a weight that is computed by Eq. [321 where 
S pa th{ x (t — > tf)) an d S' path (x(t — » tf)) are given by Eq.^Jwith b(x(r),T) — and 6(x(r),r) = x*(t), 
respectively. 

The results of the MC importance sampling for various x for low noise (y = 0.001) and high noise 
(v = 0.1) are also shown in fig. 0:. The dots are the results of the MC importance sampling at low noise 
and closely follow the Laplace results. Note the discontinuous change in slope at x — —6, which implies 
a discontinuous change in the optimal control value u at that point: For x > —6 the spider steers for 
the bridge, which requires a larger control value than for x < —6 when the optimal trajectory is around 
the lake. Thus, the optimal path is simply given by the shortest path and noise is ignored in these 
considerations. 

The MC estimates for v — 0.1 are indicated by the stars in fig. 0:. Since noise is large, the Laplace 
approximation is not valid, and indeed are very different from the MC estimate. The Laplace approxi- 
mation ignores the effect of deviations from the deterministic trajectory on the Actions . Thus, it does 
not take into account that the spider may wander off the bridge and drowns, which at this level of noise 
will happen with almost probability one and makes Sbridge much larger than <Si a ke- The MC importance 
sampling is guided by trajectories around the lake, that likely survive and by trajectories over the bridge, 
that will likely drown and thus will not contribute to Eq. 1291 The estimate for J is thus dominated 
by trajectories around the lake and the cost-to-go increases with increasing x. Also note, that the MC 
estimate puts the minimum of J not at x = —6 but safely away from the lake, so that spider is not likely 
to fall in the lake on the low side either, and will have a safe journey home. 

7 Discussion 

In this paper, we have addressed the problem of computing stochastic optimal control. The direct solution 
of the HJB equation requires a discretization of space and time. This computation naturally becomes 
intractable in both memory requirement and cpu time in high dimensions. We have shown, that for a 
certain class of problems the control can be computed by a path integral. The class of problems includes 
arbitrary dynamical systems, but with a limited control mechanism. It includes LQ control as a special 
case. The path integral approach has the advantage that the n-dimensional x-space integration of the 
HJB equation is replaced by an n-dimensional sampling problem. For high-dimensional problems, a 
stochastic integration method is expected to be much more efficient than numerical integration of the 
HJB equation directly, which scales exponentially in n. 

The obvious approximation methods to use are the Laplace approximation, the variational approxi- 
mation and MC sampling. The Laplace approximation is very efficient. The deterministic trajectories are 
found by minimizing the action, which can be done by standard numerical methods. It typically requires 
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Figure 6: The drunken spider problem. A spider located at x and t = — 1 wants to arrive home (x = 0) 
at time tt. The lake is indicated by the white square area, interrupted by a narrow bridge. The 
lake is modelled by the infinite potential given by Eq. 1351 with —a = b = 0.1, c = — oo and d = —6. 
ii = Q, t% = 4, tf = 5 and R = 1. The cost-to-go is computed by forward importance sampling as outlined 
in section l5.2l The guiding Laplace approximations are the deterministic trajectories over the bridge and 
around the lake. Time discretization dt = 0.012. (a) Some stochastic trajectories used to compute J for 
v = 0.001. (b) Some stochastic trajectories used to compute J for v = 0.1. (c) The optimal cost-to-go 
J(x,t) in the Laplace approximation for v = 0.001 and v = 0.1 solid line (these two curves coincide). 
The MC importance sampling estimates are based on 1000 trajectories per x for v = 0.001 (dots) and 
for v = 0.1 (stars). 
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0{n 2 k 2 ) operations, where n is the dimension of the problem and k is the number of time discretizations. 
We have seen that the multi-modal Laplace approximation gives non-trivial solutions involving symmetry 
breaking. 

Computing the path integral by MC sampling is clearly a very generic approach, that for many 
practical control applications may well be the best way to go. Naive sampling should be replaced by 
more advanced sampling schemes. I have only considered one simple improvement using importance 
sampling. Other possible improvements could be a Gibbs sampler or a Metropolis-Hasting sampler. 
Clearly, more work in this direction must be done. 

In this paper we have numerically computed the path integrals using the most simple discretization 
strategy: short time averaging |1(J|. The computation can be made much more efficient using Fourier 
discretization |1 H or other subspace approximations (compact splines or wavelets) ^3]- ^ n each of 
these methods the path integral is reduced to a high (but finite) dimensional Riemann integral, which is 
approximated using a Monte Carlo method. These more advanced discretizations can be combined with 
any of the mentioned MC methods. 

I have not discussed the variational approximation in this paper. This approach to approximating the 
path integral is also known as variational perturbation theory and gives an expansion of the path integral 
in terms of the anharmonic interaction terms and a variational function that is to be optimized [14] . The 
lowest term in the expansion is similar to what is known as the variational approximation in machine 
learning using the Jensen's bound |15j . but one can also consider higher order terms. The expansion is 
around a tractable dynamics, such as for instance the harmonic oscillator, whose variational parameters 
are optimized such as to best approximate the path integral. The application of this method to optimal 
control would be the topic of another paper. A complication of such an analytic treatment is the presence 
of topological constraints, such as walls and obstacles. 

There exist other fields of research that use path integrals and where dedicated numerical methods 
have been developed to solve them. For instance, in chemical physics path integrals are used to describe 
conformational changes in molecules over large time scales. The problem is similar to an optimal control 
problem such as navigating a maze: The begin and end positions are known, and one or more path of 
minimal cost needs to be found. A prominent method in this field is transition path sampling |16j , which 
can be viewed as a Metropolis-Hasting sampling scheme in path space, where a new path is sampled 
by changing part of the current path and accepting the new path with a probability. This approach is 
probably also suitable for optimal control. 

There is a superficial relation between the work presented in this paper and the body of work that seeks 
to find a particle interpretation of quantum mechanics. In fact, the log transformation was motivated 
from that work. Madelung ^7] observed that if \& = ^/pexp(iJ/h) is the wave function that satisfies the 
Schrodinger equation, p and J satisfy two coupled equations. One equation describes the dynamics of p 
as a Fokker -Planck equation. The other equation is a Hamilton-Jacobi equation for J with an additional 
term, called the quantum-mechanical potential which involves p. Nelson showed that these equations 
describe a stochastic dynamics in a force field given by the VJ, where the noise is proportional to h 

[HUH. 

Comparing this to the relation = exp(— J/A) used in this paper, we see that A plays the role of h as 
in the QM case. However, the big difference is that there is only one real valued equation, and not two 
as in the quantum mechanical case. In the control case, p is computed as an alternative to computing 
the HJB equation. In the QM case, the dynamics of p and J are computed together. The QM density 
evolution is non-linear in p because the drift force that enters the Fokker-Planck equation depends on p 
through J as computed from the HJ equation. 
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